THE SAFE USE OF SYNTHETIC DATA IN CLASSIFICATION Program Proposal for Ph.D. in Computer Science Draft for Comments
نویسنده
چکیده
When is it safe to use synthetic data in supervised classification? Trainable classifier technologies require large representative training sets consisting of samples labeled with their true class. Acquiring such training sets is difficult and costly. One way to alleviate this problem is to enlarge training sets by generating artificial, synthetic samples. Of course this immediately raises many additional questions, perhaps the first being “Why should we trust artificially generated data to be an accurate representative of the real distributions?” Other questions include “When will the use of synthetic data work as well as real data ?” and “Can synthetic data produce better results than training only on real data?”. We distinguish between sample space (the set of real samples), generator space (all samples that can be generated synthetically), and finally, feature space (the set of samples in terms of finite numerical values). In this proposal, we discuss a small experiment, in which we produced the synthetic data in feature space, that is, by varying three of the features (the RG and B components) of the data files.We showed that for some optimal number of seed files, the performance of the synthetic data matched that of the real data. However, there
منابع مشابه
ADABOOST ENSEMBLE ALGORITHMS FOR BREAST CANCER CLASSIFICATION
With an advance in technologies, different tumor features have been collected for Breast Cancer (BC) diagnosis, processing of dealing with large data set suffers some challenges which include high storage capacity and time require for accessing and processing. The objective of this paper is to classify BC based on the extracted tumor features. To extract useful information and diagnose the tumo...
متن کاملAnalysis of Users’ Opinions about Reasons for Divorce
One of the most important issues related to knowledge discovery is the field of comment mining. Opinion mining is a tool through which the opinions of people who comment about a specific issue can be evaluated in order to achieve some interesting results. This is a subset of data mining. Opinion mining can be improved using the data mining algorithms. One of the important parts of opinion minin...
متن کاملOptimizing Voyage Plan in way of Persian Gulf and Red Sea Using Meteorology and Oceanography Satellite Data
The weather forecast by satellite data is a good guideline for assessment of voyage planning route in order to have safe and economic voyage for shipping. ISO15016, "Guidelines for the assessment of speed and performance by analysis of speed trial data", has been developed by the working group ISO/TC 8/SC 9/WG 2. This paper presents the effects of wave speed & direction, wind speed and directio...
متن کاملA Model of Redundant Information in Dialogue: The Role of Resource Bounds (Dissertation Proposal)
This document is a proposal of research intended to complete a Ph.D. in Computer Science. The overall goal of the proposed work is to demonstrate a connection between agents as limited reasoners and the use of informationally redundant utterances in problem-solving dialogues. This document describes some long range objectives and some preliminary results toward this goal. Comments from readers ...
متن کاملEvaluation and Prediction of the Impact of Parasite Waves and Cell Phone Use by Pregnant Mothers on the Volume of Amniotic Fluid based on Data Mining Algorithms
Introduction: Nowadays, the effects of radiation and constant use of cell phones have led to some problems. These radiations cause disorders in different systems of human body and even in a growing fetus. The aim of this study was to find the effect of using cell phone and internet by pregnant women on the amount of amniotic fluid. Method: First, a questionnaire was designed and evaluated by o...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006